Skip to content

feat: support Spark expression slice#4149

Open
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:feat/array-slice
Open

feat: support Spark expression slice#4149
andygrove wants to merge 3 commits into
apache:mainfrom
andygrove:feat/array-slice

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Add native support for Spark's slice(array, start, length) expression so it runs on Comet instead of falling back to Spark.

The datafusion-spark crate already ships a SparkSlice, but it is not Spark-compatible: when a negative start lies before the beginning of the array (e.g. slice([a], -2, 2)), it returns the first element instead of an empty array. We can upstream the fix later; for now this PR ships a Comet-local implementation.

What changes are included in this PR?

  • native/spark-expr/src/array_funcs/array_slice.rs: new SparkArraySlice UDF (spark_array_slice) implementing Spark's slice semantics, including 1-based indexing, negative-start-from-end, error on start = 0 or length < 0, and clamping length to the array end. Supports both List and LargeList element storage.
  • native/spark-expr/src/comet_scalar_funcs.rs: register the new UDF.
  • spark/src/main/scala/org/apache/comet/serde/arrays.scala: CometSlice serde casts the start/length args to Long and serialises a call to spark_array_slice, promising containsNull = true to match DataFusion's list nullability.
  • spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: register Slice in arrayExpressions.

How are these changes tested?

  • 12 native unit tests in array_slice.rs covering positive / negative / zero / overflowing start, length 0, length past end, null inputs, empty arrays, and the error cases.
  • New SQL test file spark/src/test/resources/sql-tests/expressions/array/slice.sql covering all-literal, column + literal, and column-only argument combinations across boolean, tinyint, smallint, int, bigint, float, double, decimal, date, timestamp, timestamp_ntz, string, and nested array element types, plus the negative-start-overflow case that exposed the upstream bug.

@andygrove andygrove marked this pull request as ready for review April 30, 2026 01:03
@andygrove andygrove moved this from Todo to In progress in Comet Development May 13, 2026
@andygrove andygrove added this to the 0.17.0 (June 2026) milestone May 13, 2026
Copy link
Copy Markdown
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending on the conflict resolution

@comphead
Copy link
Copy Markdown
Contributor

Filed apache/datafusion#22400 to fix edge case in upstream

# Conflicts:
#	native/spark-expr/src/comet_scalar_funcs.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants